A block coding method that leads to significantly lower entropy values for the proteins of Haemophilus Influenzae and its coding sections
نویسنده
چکیده
A simple statistical block code in combination with the LZW-based compression utilities gzip and compress has been found to increase by a significant amount the level of compression possible for the proteins encoded in Haemophilus Influenzae (hi), the first fully sequenced genome. The method yields an entropy of 3.665 bits per symbol (bps), which is 0.657 bps below the maximum of 4.322 bps. This is an improvement of 0.452 bps over the best known to date of 4.118 bps using the lza-CTW algorithm of Matsumoto, Sadakane, and Imai, the next best being 4.143 bps using Nevill-Manning and Witten's cp algorithm. Using an efficient inverse map from the 20 amino acids to the 61 triplets that code for them, the genome too is found to be compressible, although the gain is not as high. Calculated estimates based on this latter method yield an entropy of 1.757 bps for the coding portions of the genome, with a possibly lower actual entropy. Both of these results, which flow from the sequential use of statistics-based encoding techniques followed by substitution-based text compression algorithms (gzip, compress), hint at the existence of hitherto unexplored redundancies at both the global and the local level in hi and the proteins coded by it. Further work may help determine whether such decreases in entropy are possible with other proteins and genomes or hi is a rarity (perhaps even unique) in this respect.
منابع مشابه
A Block Coding Method that Leads to Significantly Lower Entropy Values for the Proteins and Coding Sections of Haemophilus influenzae
A simple statistical block code in combination with the LZW-based compression utilities gzip and compress has been found to increase by a significant amount the level of compression possible for the proteins encoded in Haemophilus influenzae, the first fully sequenced genome. The method yields an entropy value of 3.665 bits per symbol (bps), which is 0.657 bps below the maximum of 4.322 bps and...
متن کاملA Fast Block Size Decision For Intra Coding in HEVC Standard
Intra coding in High efficiency video coding (HEVC) can significantly improve the compression efficiency using 35 intra-prediction modes for 2N×2N (N is an integer number ranging from six to two) luma blocks. To find the luma block with the minimum rate-distortion, it must perform 11932 different rate-distortion cost calculations. Although this approach improves coding efficiency compared to th...
متن کاملA Fast Block Size Decision For Intra Coding in HEVC Standard
Intra coding in High efficiency video coding (HEVC) can significantly improve the compression efficiency using 35 intra-prediction modes for 2N×2N (N is an integer number ranging from six to two) luma blocks. To find the luma block with the minimum rate-distortion, it must perform 11932 different rate-distortion cost calculations. Although this approach improves coding efficiency compared to th...
متن کاملImprovement of Large-scale PRP production by Haemophilus influenzae typeb, using modified CY medium
Background and Objective: Haemophilus influenzae type b (Hib) is a gram negative bacterium and one of the most common causative agents of acute meningitis in infants and less than 5 years old children worldwide. The production of Hib capsular polysaccharide polyribosyl ribitolphosphate (PRP) is important for the production of conjugate vaccines against Hib infections. The aim of this study is t...
متن کاملVaccine Candidates against Nontypeable Haemophilus influenzae: a Review
Nonencapsulated, nontypeable Hemophilus influenzae (NTHi) remains an important cause of acute otitis and respiratory diseases in children and adults. NTHi bacteria are one of the major causes of respiratory tract infections, including acute otitis media, cystic fibrosis, and community-acquired pneumonia among children, especially in developing countries. The bacteria can also cause chronic dise...
متن کامل